OCPBUGS-84308: fix(cpo) delete terminated MCD pods to retry in-place upgrades by PoornimaSingour · Pull Request #8434 · openshift/hypershift

PoornimaSingour · 2026-05-06T03:57:30Z

What this PR does / why we need it:

When an in-place MachineConfig daemon pod is prematurely terminated (e.g., by a forced node drain), it may transition to Succeeded or Failed phase without having completed the configuration update. Previously, reconcileUpgradePods did not check the pod's phase when it already existed, leaving the terminated pod in place and causing the upgrade to stall indefinitely.

Now, when an MCD pod exists in a terminal phase (Succeeded or Failed) on a node that still requires upgrading, the controller deletes the pod so it is recreated on the next reconciliation cycle.

Which issue(s) this PR fixes:

Fixes : https://redhat.atlassian.net/browse/OCPBUGS-84308

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

Summary by CodeRabbit

Bug Fixes
- Upgrade flow now removes terminated upgrade pods (Succeeded/Failed) so retries can proceed and in-place upgrades continue after prior attempts finish.
- Error reporting and logging around upgrade pod reconciliation improved for clearer operational visibility.
Tests
- Added unit tests covering upgrade pod lifecycle: deletion of terminated pods, retention of running pods, creation when missing, skip on terminating pods, and cleanup of idle pods.

openshift-merge-bot · 2026-05-06T03:57:32Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci · 2026-05-06T03:57:34Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-05-06T03:57:44Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

reconcileInPlaceUpgrade now returns errors from reconcileUpgradePods as “failed to reconcile upgrade pods”. reconcileUpgradePods was extended to detect upgrade Machine Config Daemon pods in Succeeded or Failed phases when the corresponding node still needs an upgrade; it logs the detection, deletes the terminated pod (ignoring NotFound), and relies on subsequent reconciles to recreate the pod. Existing behavior for Running pods, creating missing pods, and deleting idle pods for fully updated nodes is covered by a new TestReconcileUpgradePods unit test.

Sequence Diagram(s)

sequenceDiagram
    participant Controller as Controller
    participant API_Server as API Server
    participant Node as Node
    participant Pod as Upgrade Pod

    Controller->>API_Server: Get upgrade Pod for node
    API_Server-->>Controller: Return Pod (Running | Succeeded | Failed | NotFound)

    alt Pod is Running
        Controller->>Controller: Leave Pod unchanged
    else Pod is Succeeded or Failed and Node needs upgrade
        Controller->>API_Server: Log detection and Delete Pod
        API_Server-->>Controller: Delete response (Success / NotFound / Error)
        Controller->>Controller: Retry path will recreate pod later
    else Pod NotFound
        Controller->>API_Server: Create upgrade Pod
        API_Server-->>Controller: Create response (Success / Error)
    end

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: deleting terminated MCD pods to allow in-place upgrades to retry, which directly matches the core functionality described in the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test names in TestReconcileUpgradePods are stable and deterministic. No dynamic content, UUIDs, timestamps, or generated identifiers appear in test titles. Test data in bodies, not names.
Test Structure And Quality	✅ Passed	Check for Ginkgo, code uses Go testing. Table-driven single responsibility. Fake clients - no cleanup. Context correct. Assertions meaningful. Follows codebase patterns.
Microshift Test Compatibility	✅ Passed	TestReconcileUpgradePods is a standard Go unit test, not Ginkgo e2e. The check applies only to Ginkgo e2e tests.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	The PR adds only standard Go unit tests (TestReconcileUpgradePods), not Ginkgo e2e tests. The custom check applies specifically to new Ginkgo e2e tests. No Ginkgo imports or test markers present.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies pod termination handling only. No new scheduling constraints introduced. Uses hostname nodeSelector (necessary for target node) and existing wildcard toleration.
Ote Binary Stdout Contract	✅ Passed	No OTE Stdout Contract violations. Files are controller/test code with no process-level entry points or stdout writes.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	TestReconcileUpgradePods is a standard Go unit test with fake clients, not a Ginkgo e2e test. The custom check applies only to Ginkgo e2e tests, making it not applicable.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-05-06T03:59:25Z

@PoornimaSingour: This pull request references Jira Issue OCPBUGS-84308, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

When an in-place MachineConfig daemon pod is prematurely terminated (e.g., by a forced node drain), it may transition to Succeeded or Failed phase without having completed the configuration update. Previously, reconcileUpgradePods did not check the pod's phase when it already existed, leaving the terminated pod in place and causing the upgrade to stall indefinitely.

Now, when an MCD pod exists in a terminal phase (Succeeded or Failed) on a node that still requires upgrading, the controller deletes the pod so it is recreated on the next reconciliation cycle.

Assisted-by: Claude Opus 4.6 noreply@anthropic.com

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes : https://redhat.atlassian.net/browse/OCPBUGS-84308

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Summary by CodeRabbit

Bug Fixes

Improved handling of terminated upgrade pods to enable retry mechanisms during in-place upgrades, allowing progress even when previous upgrade attempts have completed.

Tests

Added comprehensive test coverage for upgrade pod lifecycle management across multiple scenarios, including pod deletion, retention, creation, and cleanup operations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-05-06T03:59:26Z

@PoornimaSingour: This pull request references Jira Issue OCPBUGS-84308, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

When an in-place MachineConfig daemon pod is prematurely terminated (e.g., by a forced node drain), it may transition to Succeeded or Failed phase without having completed the configuration update. Previously, reconcileUpgradePods did not check the pod's phase when it already existed, leaving the terminated pod in place and causing the upgrade to stall indefinitely.

Now, when an MCD pod exists in a terminal phase (Succeeded or Failed) on a node that still requires upgrading, the controller deletes the pod so it is recreated on the next reconciliation cycle.

Assisted-by: Claude Opus 4.6 noreply@anthropic.com

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes : https://redhat.atlassian.net/browse/OCPBUGS-84308

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Summary by CodeRabbit

Bug Fixes

Improved handling of terminated upgrade pods to enable retry mechanisms during in-place upgrades, allowing progress even when previous upgrade attempts have completed.

Tests

Added comprehensive test coverage for upgrade pod lifecycle management across multiple scenarios, including pod deletion, retention, creation, and cleanup operations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

codecov · 2026-05-06T04:01:20Z

Codecov Report

❌ Patch coverage is 59.09091% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.32%. Comparing base (8c162c4) to head (9092cca).
⚠️ Report is 88 commits behind head on main.

Files with missing lines	Patch %	Lines
...tor/controllers/inplaceupgrader/inplaceupgrader.go	59.09%	6 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8434      +/-   ##
==========================================
+ Coverage   40.69%   41.32%   +0.63%     
==========================================
  Files         755      755              
  Lines       93373    93446      +73     
==========================================
+ Hits        37994    38618     +624     
+ Misses      52646    52081     -565     
- Partials     2733     2747      +14

Files with missing lines	Coverage Δ
...tor/controllers/inplaceupgrader/inplaceupgrader.go	`68.71% <59.09%> (+9.68%)`	⬆️

... and 35 files with indirect coverage changes

Flag	Coverage Δ
cmd-support	`34.86% <ø> (+0.16%)`	⬆️
cpo-hostedcontrolplane	`43.50% <ø> (+1.69%)`	⬆️
cpo-other	`43.21% <59.09%> (+1.82%)`	⬆️
hypershift-operator	`51.00% <ø> (+0.16%)`	⬆️
other	`31.64% <ø> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

🧹 Nitpick comments (1)

control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go (1)

736-738: ⚡ Quick win

Tighten deleted-pod assertion to NotFound instead of any error.

HaveOccurred() can pass for unrelated failures. Asserting IsNotFound makes the test intent explicit and failures clearer.

Proposed test hardening

+import apierrors "k8s.io/apimachinery/pkg/api/errors"
...
 			if tc.expectPodDeleted {
 				g.Expect(getErr).To(HaveOccurred(), "expected pod to be deleted")
+				g.Expect(apierrors.IsNotFound(getErr)).To(BeTrue(), "expected pod get to return NotFound after deletion")
 			}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go`
around lines 736 - 738, Replace the loose assertion
g.Expect(getErr).To(HaveOccurred()) for deleted pods with a NotFound-specific
check: import k8s.io/apimachinery/pkg/api/errors as apierrors (or errors alias
used elsewhere) and replace the assertion with
g.Expect(apierrors.IsNotFound(getErr)).To(BeTrue(), "expected pod to be
NotFound") when tc.expectPodDeleted is true, referencing the tc.expectPodDeleted
branch and the getErr variable so the test fails only for a NotFound error.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go`:
- Around line 736-738: Replace the loose assertion
g.Expect(getErr).To(HaveOccurred()) for deleted pods with a NotFound-specific
check: import k8s.io/apimachinery/pkg/api/errors as apierrors (or errors alias
used elsewhere) and replace the assertion with
g.Expect(apierrors.IsNotFound(getErr)).To(BeTrue(), "expected pod to be
NotFound") when tc.expectPodDeleted is true, referencing the tc.expectPodDeleted
branch and the getErr variable so the test fails only for a NotFound error.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 21221ab1-79e2-4d7c-8429-c9fb954b5229

📥 Commits

Reviewing files that changed from the base of the PR and between 7ac2953 and 6fbc013.

📒 Files selected for processing (2)

control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go
control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go`:
- Around line 762-764: The test currently checks for a deleted pod using a broad
error assertion (g.Expect(getErr).To(HaveOccurred()) when tc.expectPodDeleted is
true); change this to assert specifically that the error is a NotFound error by
using the Kubernetes API errors helper on the getErr variable (e.g., assert
apierrors.IsNotFound(getErr) via the testing framework) so the deleted-pod
branch only passes for NotFound and not for other client/read errors; locate the
check guarded by tc.expectPodDeleted in inplaceupgrader_test.go and replace the
HaveOccurred() assertion with a specific IsNotFound assertion referencing
getErr.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 8b8c3bd2-d52a-403e-a20b-a5bbb89688c9

📥 Commits

Reviewing files that changed from the base of the PR and between 6fbc013 and b5637a4.

📒 Files selected for processing (1)

control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go

PoornimaSingour · 2026-05-06T10:15:13Z

/jira refresh

openshift-ci-robot · 2026-05-06T10:15:20Z

@PoornimaSingour: This pull request references Jira Issue OCPBUGS-84308, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (dhuynh@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

PoornimaSingour · 2026-05-06T10:15:33Z

@coderabbitai full review

coderabbitai · 2026-05-06T10:15:45Z

✅ Actions performed

Full review triggered.

openshift-ci-robot · 2026-05-06T10:18:27Z

@PoornimaSingour: This pull request references Jira Issue OCPBUGS-84308, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (dhuynh@redhat.com), skipping review request.

Details

In response to this:

What this PR does / why we need it:

When an in-place MachineConfig daemon pod is prematurely terminated (e.g., by a forced node drain), it may transition to Succeeded or Failed phase without having completed the configuration update. Previously, reconcileUpgradePods did not check the pod's phase when it already existed, leaving the terminated pod in place and causing the upgrade to stall indefinitely.

Now, when an MCD pod exists in a terminal phase (Succeeded or Failed) on a node that still requires upgrading, the controller deletes the pod so it is recreated on the next reconciliation cycle.

Which issue(s) this PR fixes:

Fixes : https://redhat.atlassian.net/browse/OCPBUGS-84308

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Summary by CodeRabbit

Bug Fixes

Upgrade flow now removes terminated upgrade pods (Succeeded/Failed) so retries can proceed and in-place upgrades continue after prior attempts finish.

Tests

Added unit tests covering upgrade pod lifecycle: deletion of terminated pods, retention of running pods, creation when missing, and cleanup of idle pods on fully updated nodes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go (1)
352-363: ⚡ Quick win

Deleted-pod retry has no requeue guarantee — upgrade may stall.

After the terminated pod is deleted, reconcileUpgradePods returns nil, reconcileInPlaceUpgrade returns nil, and Reconcile returns ctrl.Result{} (no requeue). Because the deletion doesn't mutate any node annotation, no node-watch event fires to trigger a follow-up reconciliation. If no other MachineSet event arrives, the replacement pod is never created and the upgrade stalls indefinitely — which is exactly the problem this PR is fixing.

Consider either propagating a boolean "needs requeue" flag back up through reconcileInPlaceUpgrade to Reconcile, or returning ctrl.Result{RequeueAfter: ...} whenever at least one pod was deleted:
💡 Sketch of the fix
-func (r *Reconciler) reconcileUpgradePods(...) error {
+func (r *Reconciler) reconcileUpgradePods(...) (bool, error) {
     ...
+    podDeleted := false
     ...
     } else if pod.Status.Phase == corev1.PodSucceeded || pod.Status.Phase == corev1.PodFailed {
         ...
         if err := hostedClusterClient.Delete(ctx, pod); err != nil {
             ...
-            return fmt.Errorf("error deleting terminated upgrade MCD pod for node %s: %w", node.Name, err)
+            return false, fmt.Errorf("error deleting terminated upgrade MCD pod for node %s: %w", node.Name, err)
         }
+        podDeleted = true
     }
     ...
-    return nil
+    return podDeleted, nil
 }
And in reconcileInPlaceUpgrade / Reconcile, propagate the flag to return ctrl.Result{RequeueAfter: 5 * time.Second}.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go`
around lines 352 - 363, reconcileUpgradePods currently deletes terminated
upgrade pods but returns nil which causes reconcileInPlaceUpgrade and Reconcile
to not requeue and the replacement pod may never be created; change
reconcileUpgradePods to return a (bool, error) or similar indicator (e.g.,
deletedPod bool) when it deletes at least one pod, update
reconcileInPlaceUpgrade to propagate that flag up, and have Reconcile return
ctrl.Result{RequeueAfter: 5 * time.Second} (or another short duration) whenever
the flag indicates a pod was deleted so the controller will immediately requeue
and create the replacement pod.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go`:
- Around line 692-715: Update the test case that sets existingPod with a
DeletionTimestamp and Finalizers so it actually verifies the "skip" behavior
instead of just checking getErr; in the assertion block that currently checks
getErr (references variables existingPod, expectPodSkipped and the retrieved pod
variable), either assert that the retrieved pod's DeletionTimestamp is non-nil
(e.g., pod.DeletionTimestamp != nil) to prove we hit the skip path, or
replace/add a fake-client interceptor (WithInterceptorFuncs) to spy on Delete
and assert Delete was never called for that pod — do not rely solely on getErr.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go`:
- Around line 352-363: reconcileUpgradePods now deletes both idle and terminated
pods but the error wrap at the caller still says "failed to delete idle upgrade
pods", which is misleading; update the error wrapping at the call site that
wraps the error from hostedClusterClient.Delete (the delete call inside
reconcileUpgradePods) to use a neutral message like "failed to delete upgrade
pod for node %s" or include the pod phase/node context so failures deleting
terminated pods are accurately described; adjust the fmt.Errorf wrapper (the
existing "failed to delete idle upgrade pods" message) to reference the upgrade
pod deletion generically (or include pod.Status.Phase) so logs reflect the
actual deletion target.

---

Nitpick comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go`:
- Around line 352-363: reconcileUpgradePods currently deletes terminated upgrade
pods but returns nil which causes reconcileInPlaceUpgrade and Reconcile to not
requeue and the replacement pod may never be created; change
reconcileUpgradePods to return a (bool, error) or similar indicator (e.g.,
deletedPod bool) when it deletes at least one pod, update
reconcileInPlaceUpgrade to propagate that flag up, and have Reconcile return
ctrl.Result{RequeueAfter: 5 * time.Second} (or another short duration) whenever
the flag indicates a pod was deleted so the controller will immediately requeue
and create the replacement pod.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 7df03c82-4975-43fe-9170-34a23bcc9534

📥 Commits

Reviewing files that changed from the base of the PR and between 7ac2953 and c82c543.

📒 Files selected for processing (2)

control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go
control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go

PoornimaSingour · 2026-05-11T16:34:08Z

@coderabbitai review

coderabbitai · 2026-05-11T16:34:16Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

PoornimaSingour · 2026-05-28T14:58:57Z

@jparrill , Addressed all the comments and have tested E2E testing manually by building the quay.io/rhn_support_psingour/hypershift:OCPBUGS-84308-cpo image. Test results https://drive.google.com/file/d/1JQ801Zs8x_WrnV-XNbAgU0zd06tvm80J/view?usp=drive_link

Test Scenarios Verified

Force-deleted a running MCD pod during in-place upgrade — controller detected the terminated pod and recreated it within ~30s, upgrade completed successfully.
Verified node with matching config annotations but MachineConfigDaemonState != Done is not prematurely marked as idle.
Confirmed controller requeues every ~30s during active upgrades (68 reconcile cycles observed), closing the gap for missed deletion events.
Clean in-place upgrade completed with zero false-positive "terminated upgrade pod" messages — no regression.
After upgrade reached Done state, idle MCD pod was automatically cleaned up via deleteUpgradePodIfExists helper.
Multi-node test: force-deleted MCD on node2 while node1 completed normally — both nodes independently reached Done state, proving the loop continue doesn't skip subsequent nodes.

csrwng

Review of terminated MCD pod handling and helper extraction.

csrwng · 2026-06-01T21:03:42Z

 	return nil
 }

+func deleteUpgradePodIfExists(ctx context.Context, c client.Client, pod *corev1.Pod) error {


deleteUpgradePodIfExists duplicates support/k8sutil.DeleteIfNeeded — same Get/DeletionTimestamp/Delete/IsNotFound pattern. Consider using the existing utility instead of introducing a new helper.

…grades When an in-place MCD upgrade pod terminates (Failed/Succeeded) but the node still needs an upgrade, the controller now deletes the terminated pod so a fresh one can be recreated on the next reconcile loop. A periodic requeue (upgradeRequeueInterval = 30s) ensures the controller re-evaluates nodes that still need upgrades rather than waiting for an external event. Additionally: - Extract deleteUpgradePodIfExists helper to reduce duplication across reconcileUpgradePods and deleteUpgradeManifests - Add test coverage for PodPending phase, multi-node mixed states, NotFound on Delete, RequeueAfter assertion, and Delete failure scenarios Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the local deleteUpgradePodIfExists helper with the shared k8sutil.DeleteIfNeeded utility to reduce duplication and improve consistency across the codebase. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jparrill · 2026-06-03T13:26:59Z

/approve

openshift-ci · 2026-06-03T13:30:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jparrill, PoornimaSingour

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jparrill]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

csrwng · 2026-06-03T14:41:33Z

/lgtm

openshift-merge-bot · 2026-06-03T14:43:17Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

PoornimaSingour · 2026-06-08T08:28:00Z

/verified by me

Test Results - https://drive.google.com/drive/u/0/folders/1LqG7eXZuFFzwiQWHTzDbxgCTQLwZPWIL
All 6 scenarios passed

Terminated MCD pod retry.
No premature Done
30s requeue safety net,
Clean upgrade no regression,
Idle pod cleanup,
multi-node mixed states with force-delete on one node.

openshift-ci-robot · 2026-06-08T08:28:22Z

@PoornimaSingour: This PR has been marked as verified by me.

Details

In response to this:

/verified by me

Test Results - https://drive.google.com/drive/u/0/folders/1LqG7eXZuFFzwiQWHTzDbxgCTQLwZPWIL
All 6 scenarios passed

Terminated MCD pod retry.

No premature Done

30s requeue safety net,

Clean upgrade no regression,

Idle pod cleanup,

multi-node mixed states with force-delete on one node.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

PoornimaSingour · 2026-06-08T08:36:04Z

/retest

openshift-ci · 2026-06-08T08:45:58Z

@PoornimaSingour: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-06-08T08:59:11Z

@PoornimaSingour: Jira Issue Verification Checks: Jira Issue OCPBUGS-84308
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-84308 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

What this PR does / why we need it:

When an in-place MachineConfig daemon pod is prematurely terminated (e.g., by a forced node drain), it may transition to Succeeded or Failed phase without having completed the configuration update. Previously, reconcileUpgradePods did not check the pod's phase when it already existed, leaving the terminated pod in place and causing the upgrade to stall indefinitely.

Now, when an MCD pod exists in a terminal phase (Succeeded or Failed) on a node that still requires upgrading, the controller deletes the pod so it is recreated on the next reconciliation cycle.

Which issue(s) this PR fixes:

Fixes : https://redhat.atlassian.net/browse/OCPBUGS-84308

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Summary by CodeRabbit

Bug Fixes

Upgrade flow now removes terminated upgrade pods (Succeeded/Failed) so retries can proceed and in-place upgrades continue after prior attempts finish.

Error reporting and logging around upgrade pod reconciliation improved for clearer operational visibility.

Tests

Added unit tests covering upgrade pod lifecycle: deletion of terminated pods, retention of running pods, creation when missing, skip on terminating pods, and cleanup of idle pods.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-robot · 2026-06-09T11:27:26Z

Fix included in release 5.0.0-0.nightly-2026-06-09-022526

PoornimaSingour · 2026-06-11T07:35:08Z

/jira backport release-4.22

openshift-ci-robot · 2026-06-11T07:35:12Z

@PoornimaSingour: Failed to create backported issues: An error was encountered cloning bug for cherrypick for bug OCPBUGS-84308 on the Jira server at https://redhat.atlassian.net. No known errors were detected, please see the full error message for details.

Full error message.


request failed. Please analyze the request body for more details. Status code: 400: {"errorMessages":[],"errors":{"customfield_10980":"Field does not support update 'customfield_10980'","customfield_10978":"Field does not support update 'customfield_10978'","customfield_10979":"Field does not support update 'customfield_10979'"}}

Please contact an administrator to resolve this issue, then request a bug refresh with /jira refresh.

Details

In response to this:

/jira backport release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

bradmwilliams · 2026-06-11T13:13:02Z

/jira backport release-4.22

openshift-ci-robot · 2026-06-11T13:14:13Z

@bradmwilliams: The following backport issues have been created:

OCPBUGS-88325 for branch release-4.22

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.22

Details

In response to this:

/jira backport release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-cherrypick-robot · 2026-06-11T13:15:26Z

@openshift-ci-robot: #8434 failed to apply on top of branch "release-4.22":

Applying: fix(cpo): delete terminated MCD pods and requeue to retry in-place upgrades
Using index info to reconstruct a base tree...
M	control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go
M	control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go
Falling back to patching base and 3-way merge...
Auto-merging control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go
Auto-merging control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go
CONFLICT (content): Merge conflict in control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 fix(cpo): delete terminated MCD pods and requeue to retry in-place upgrades

Details

In response to this:

@bradmwilliams: The following backport issues have been created:

OCPBUGS-88325 for branch release-4.22

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.22

In response to this:

/jira backport release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 6, 2026

openshift-ci Bot added the do-not-merge/needs-area label May 6, 2026

openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release and removed do-not-merge/needs-area labels May 6, 2026

PoornimaSingour changed the title ~~fix(cpo): delete terminated MCD pods to retry in-place upgrades~~ OCPBUGS-84308: fix(cpo) delete terminated MCD pods to retry in-place upgrades May 6, 2026

openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels May 6, 2026

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 6, 2026

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

Comment thread ...ane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go

PoornimaSingour force-pushed the OCPBUGS-84308 branch from b5637a4 to df176c0 Compare May 6, 2026 09:13

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 6, 2026

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

Comment thread ...ane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go

Comment thread ...ol-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go

hypershift-jira-solve-ci Bot mentioned this pull request May 7, 2026

OCPBUGS-84528: clarify pull secret in-place update behavior and add CP watches #8327

Merged

4 tasks

PoornimaSingour marked this pull request as ready for review May 12, 2026 05:45

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 12, 2026

PoornimaSingour force-pushed the OCPBUGS-84308 branch from 99fa86c to fea50de Compare May 26, 2026 13:30

csrwng reviewed Jun 1, 2026

View reviewed changes

PoornimaSingour and others added 2 commits June 3, 2026 16:42

PoornimaSingour force-pushed the OCPBUGS-84308 branch from fea50de to 9092cca Compare June 3, 2026 11:13

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 3, 2026

openshift-ci Bot assigned csrwng Jun 3, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 3, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 8, 2026

openshift-merge-bot Bot merged commit ad3da77 into openshift:main Jun 8, 2026
42 checks passed

This was referenced Jun 8, 2026

OSASINFRA-4368, OCPBUGS-84114: Update CAPO to latest stable release #8687

Open

CNTRLPLANE-3600: Bump k8s to v0.36.1, controller-runtime to v0.24.1, CAPI to v1.12.8 #8695

Draft

hypershift-jira-solve-ci Bot mentioned this pull request Jun 8, 2026

CNTRLPLANE-3271: add External OIDC e2e tests for v2 framework #8674

Merged

4 tasks

PoornimaSingour mentioned this pull request Jun 12, 2026

OCPBUGS-88325: fix(cpo) delete terminated MCD pods to retry in-place upgrades #8729

Open

Conversation

PoornimaSingour commented May 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented May 6, 2026

Uh oh!

openshift-ci Bot commented May 6, 2026

Uh oh!

coderabbitai Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Sequence Diagram(s)

❌ Failed checks (1 warning)

Uh oh!

openshift-ci-robot commented May 6, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented May 6, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

codecov Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PoornimaSingour commented May 6, 2026

Uh oh!

openshift-ci-robot commented May 6, 2026

Uh oh!

PoornimaSingour commented May 6, 2026

Uh oh!

coderabbitai Bot commented May 6, 2026

Uh oh!

openshift-ci-robot commented May 6, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

PoornimaSingour commented May 11, 2026

Uh oh!

coderabbitai Bot commented May 11, 2026

Uh oh!

PoornimaSingour commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

csrwng left a comment

Choose a reason for hiding this comment

Uh oh!

csrwng Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

jparrill commented Jun 3, 2026

Uh oh!

PoornimaSingour commented May 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 6, 2026 •

edited

Loading

codecov Bot commented May 6, 2026 •

edited

Loading

PoornimaSingour commented May 28, 2026 •

edited

Loading